library(Seurat)
library(tidyverse)
seurat_sketch <- readRDS("data/seurat_sketch.RDS")Identifying Unwanted Variation
Approximate time: XY minutes
Learning Objectives
- LO 1
- Perform clustering of cells based on significant principal components
- LO3
Single-Cell RNA-seq (scRNA-seq) Workflow
We will be performing a standard scRNA-seq workflow on our sketch assay. For more explicit details on each one of these steps, we have materials on how to conduct a scRNA-seq analysis from start to end.
Here we show the overarching steps that are takin in a classical single-cell analysis.
The broader principles of this workflow remain the same when working with spatial transcriptomics data. This lesson will cover the following steps:
FindVariableFeatures(): As before, this generates a list of highly variable genes, which may be slightly different for the sketched dataset than for the full datasetScaleData(): Highly variable genes will be confounded with the most highly expressed genes, so we need to adjust for thisRunPCA(): Perform a principal component analysis using our scaled data and variable genes. This will emphasize variation in gene expression as well as similarity across binsFindNeighbors(): Determine the Euclidean distance between bins in PCA spaceFindClusters(): Iteratively group bins together based on neighborhood distances. Higher resolution will yield more groups.RunUMAP(): TODO
With that being said, there are certain elements that are unique to a spatial dataset that we will highlight as we go through all of these steps.
Variation in the Dataset
Identify Highly Variable Features
Recall that when we were working with the entire dataset, we found the following genes to be the most variable:
# Identify the 15 most highly variable genes
ranked_variable_genes <- VariableFeatures(seurat_sketch, assay="Spatial.008um")
top_genes <- ranked_variable_genes[1:15]
top_genes [1] "IGHG3" "IGHM" "SPP1" "IGLC7" "IGLC1" "SST" "IGHG1" "INSL5" "CHGA"
[10] "VIP" "HBA2" "GCG" "CXCL8" "MMP12" "PYY"
We are going to run FindVariableFeatures() once more, but this time on our smaller dataset of 10,000 bins.
# Identify the most variable genes
seurat_processed <- FindVariableFeatures(seurat_sketch,
selection.method = "vst",
nfeatures = 2000,
assay="sketch")
seurat_processedAn object of class Seurat
36170 features across 826613 samples within 2 assays
Active assay: sketch (18085 features, 2000 variable features)
2 layers present: counts, data
1 other assay present: Spatial.008um
2 spatial fields of view present: P5CRC.008um P5NAT.008um
Do you seen any differences when we calculated HVGs for the entire dataset versus the sketch assay?
# Identify the 15 most highly variable genes
ranked_variable_genes <- VariableFeatures(seurat_processed, assay="sketch")
ranked_variable_genes[1:15] [1] "IGHM" "IGHG1" "IGLC1" "CHGA" "VIP" "CXCL8" "SPP1" "GCG" "IGHG3"
[10] "IGHA1" "PYY" "REG3A" "HBA2" "INSL5" "ODAM"
Visualize Highly Variable Genes
We can visualize the dispersion of all genes using Seurat’s VariableFeaturePlot(), which shows a gene’s average expression across all cells on the x-axis and variance on the y-axis. Ideally we would see genes lay across the entire x-axis, as we want to ensure that we are using genes that are both lowly and highly expressed.
For the visualization, we are adding labels using LabelPoints() to help us better understand which genes are driving the shape of our data.
# Identify the 15 most highly variable genes
ranked_variable_genes <- VariableFeatures(seurat_processed,
assay = "sketch")
top_genes <- ranked_variable_genes[1:15]
# Plot the average expression and variance of these genes
# With labels to indicate which genes are in the top 15
p <- VariableFeaturePlot(seurat_processed)
LabelPoints(plot = p, points = top_genes, repel = TRUE)Now we are primed to run the next step of the workflow, which is Principal Component Analysis.
PCA
Principal Component Analysis (PCA) is a technique used to emphasize variation as well as similarity, and to bring out strong patterns in a dataset; it is one of the methods used for “dimensionality reduction”.
For a more detailed explanation on PCA, please look over this lesson (adapted from StatQuests/Josh Starmer’s YouTube video). We also strongly encourage you to explore the video StatQuest’s video for a more thorough understanding.
Let’s say you are working with a single-cell RNA-seq dataset with 12,000 cells and you have quantified the expression of 20,000 genes. The schematic below demonstrates how you would go from a cell x gene matrix to principal component (PC) scores for each individual cell.
To perform PCA, we need to first choose the most variable features, then scale the data. Since highly expressed genes exhibit the highest amount of variation and we don’t want our ‘highly variable genes’ only to reflect high expression, we need to scale the data to scale variation with expression level. The Seurat ScaleData() function will scale the data by:
- Adjusting the expression of each gene to give a mean expression across cells to be 0
- Scaling expression of each gene to give a variance across cells to be 1
# Scale the data counts
seurat_processed <- ScaleData(seurat_processed)
seurat_processedAn object of class Seurat
36170 features across 826613 samples within 2 assays
Active assay: sketch (18085 features, 2000 variable features)
3 layers present: counts, data, scale.data
1 other assay present: Spatial.008um
2 spatial fields of view present: P5CRC.008um P5NAT.008um
We have calculated a new layer titled sale.data, where our scaled log-normalized counts now exist. So now we can proceed in calculating our Principal Components with RunPCA().
# Run PCA
seurat_processed <- RunPCA(seurat_processed,
assay = "sketch",
reduction.name = "pca.sketch")The information printed out show the top genes that influence a cell’s score for each component. This includes genes that have both a positive and negative weight.
If we look at our updated seurat_processed object:
seurat_processedAn object of class Seurat
36170 features across 826613 samples within 2 assays
Active assay: sketch (18085 features, 2000 variable features)
3 layers present: counts, data, scale.data
1 other assay present: Spatial.008um
1 dimensional reduction calculated: pca.sketch
2 spatial fields of view present: P5CRC.008um P5NAT.008um
We see that we have stored our PCA results: dimensional reduction calculated: pca.sketch
Visualize PCA
After the PC scores have been calculated, you are looking at a matrix of 12,000 x 12,000 that represents the information about relative gene expression in all the cells. You can select the PC1 and PC2 columns and plot that in a 2D way.
DimPlot(seurat_processed,
reduction = "pca.sketch")Ultimately what we have calculated with PCA are values that can represent how similar cells are to one another. We would expect that cells with similar gene expression will have similar PC values. For example, we would anticipate that two Fibroblast cells would have comparable gene expression - which would also results in similar scores in principal components. This is why many of the downstream steps in the remainder of this lesson use PCA as the input.
Selecting PC Dimensions
To overcome the extensive technical noise in the expression of any single gene for scRNA-seq data, Seurat assigns cells to clusters based on their PCA scores derived from the expression of the integrated most variable genes. Each PC essentially representing a “metagene” that combines information across a correlated gene set. Determining how many PCs to include in the clustering step is therefore important to ensure that we are capturing the majority of the variation, or cell types, present in our dataset.
It is useful to explore the PCs prior to deciding which PCs to include for the downstream clustering analysis.
The elbow plot is a helpful way to determine how many PCs to use for clustering so that we are capturing majority of the variation in the data. The elbow plot visualizes the standard deviation of each PC, and we are looking for where the standard deviations begins to plateau. Essentially, where the elbow appears is usually the threshold for identifying the majority of the variation. However, this method can be quite subjective.
Let’s draw an elbow plot using the top 50 PCs:
# Plot the elbow plot
ElbowPlot(object = seurat_processed,
reduction = "pca.sketch",
ndims = 50)Based on this plot, we could roughly determine the majority of the variation by where the elbow occurs around PC8 - PC10, or one could argue that it should be when the data points start to get close to the X-axis, PC30 or so. This gives us a very rough idea of the number of PCs needed to be included.
We will be using 50 PCs for our downstream steps.
Grouping Similar Cells Together
Seurat uses a graph-based clustering approach using a K-nearest neighbor approach, and then attempts to partition this graph into highly interconnected ‘quasi-cliques’ or ‘communities’ [Seurat - Guided Clustering Tutorial]. A nice in-depth description of clustering methods is provided in the SVI Bioinformatics and Cellular Genomics Lab course.
k-Nearest Neighbors
The first step is to construct a K-nearest neighbor (KNN) graph based on the euclidean distance in PCA space.
Image source: Analysis of Single cell RNA-seq data
- Edges are drawn between cells with similar features expression patterns.
- Edge weights are refined between any two cells based on shared overlap in their local neighborhoods.
This is done in Seurat by using the FindNeighbors() function. We also specify the reduction we want to use as well as which components (dims) will be used for the calculation. In this case, we will be using the first 50.
# Determine the K-nearest neighbor graph
seurat_processed <- FindNeighbors(seurat_processed,
assay = "sketch",
reduction = "pca.sketch",
dims = 1:50)Clustering
Next, Seurat will iteratively group cells together with the goal of optimizing the standard modularity function.
We will use the FindClusters() function to perform the graph-based clustering. The resolution is an important argument that sets the “granularity” of the downstream clustering and will need to be optimized for every individual experiment. For datasets of 3,000 - 5,000 cells, the resolution set between 0.4-1.4 generally yields good clustering. Increased resolution values lead to a greater number of clusters, which is often required for larger datasets.
The FindClusters() function allows us to enter a series of resolutions and will calculate the “granularity” of the clustering. This is very helpful for testing which resolution works for moving forward without having to run the function for each resolution.
Typically we would recommend using a variety of different resolutions and select a resolution that best suits your dataset. However, for the sake of time, we will proceed with a resolution = 0.65. We have more detailed explanations of how to test a variety of resolutions at one time.
seurat_processed <- FindClusters(seurat_processed,
cluster.name = "seurat_cluster.sketched",
resolution = 0.65)Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 10000
Number of edges: 478594
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8783
Number of communities: 17
Elapsed time: 1 seconds
If we take a look at the columns in our @meta.data, we see that the column seurat_cluster.sketched is a new column that was added.
seurat_processed@meta.data %>% head() %>% relocate("seurat_cluster.sketched") seurat_cluster.sketched orig.ident
P5CRC_s_008um_00602_00290-1 <NA> P5CRC
P5CRC_s_008um_00789_00234-1 <NA> P5CRC
P5CRC_s_008um_00728_00006-1 <NA> P5CRC
P5CRC_s_008um_00681_00396-1 <NA> P5CRC
P5CRC_s_008um_00078_00444-1 <NA> P5CRC
P5CRC_s_008um_00128_00278-1 <NA> P5CRC
nCount_Spatial.008um nFeature_Spatial.008um
P5CRC_s_008um_00602_00290-1 225 199
P5CRC_s_008um_00789_00234-1 76 71
P5CRC_s_008um_00728_00006-1 215 196
P5CRC_s_008um_00681_00396-1 74 71
P5CRC_s_008um_00078_00444-1 65 57
P5CRC_s_008um_00128_00278-1 1300 906
log10GenesPerUMI mitoRatio leverage.score
P5CRC_s_008um_00602_00290-1 0.9773277 0.04888889 0.09142904
P5CRC_s_008um_00789_00234-1 0.9842859 0.01315789 0.01895360
P5CRC_s_008um_00728_00006-1 0.9827724 0.04651163 0.20933028
P5CRC_s_008um_00681_00396-1 0.9903846 0.04054054 0.06665954
P5CRC_s_008um_00078_00444-1 0.9685377 0.23076923 0.02559120
P5CRC_s_008um_00128_00278-1 0.9496410 0.17307692 0.17665603
seurat_clusters
P5CRC_s_008um_00602_00290-1 <NA>
P5CRC_s_008um_00789_00234-1 <NA>
P5CRC_s_008um_00728_00006-1 <NA>
P5CRC_s_008um_00681_00396-1 <NA>
P5CRC_s_008um_00078_00444-1 <NA>
P5CRC_s_008um_00128_00278-1 <NA>
When we looked at the first few rows of our metadata, it appeared that there were many cells that do not have a cluster value.
Count how many NA’s are found for our cluster with table() function and use the argument (useNA = "ifany"):
table(seurat_processed$seurat_cluster.sketched, useNA = "ifany")
0 1 2 3 4 5 6 7 8 9 10
1656 1330 1021 1006 994 795 609 557 556 458 333
11 12 13 14 15 16 <NA>
189 167 99 93 77 60 816613
Why do you think there are so many NA values?
FindClusters() tends to automatically change the Idents() to the cluster values that were just calculated. If we take a look at our Idents we can see if that it has changed from orig.ident to seurat_cluster.sketched
Idents(seurat_processed) %>% head()P5CRC_s_008um_00602_00290-1 P5CRC_s_008um_00789_00234-1
<NA> <NA>
P5CRC_s_008um_00728_00006-1 P5CRC_s_008um_00681_00396-1
<NA> <NA>
P5CRC_s_008um_00078_00444-1 P5CRC_s_008um_00128_00278-1
<NA> <NA>
Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
UMAP
To visualize the cell clusters, there are a few different dimensionality reduction techniques that can be helpful. The most popular methods include t-distributed stochastic neighbor embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP) techniques.
Both methods aim to place cells with similar local neighborhoods in high-dimensional space together in low-dimensional space. These methods will require you to input number of PCA dimentions to use for the visualization, we suggest using the same number of PCs as input to the clustering analysis. Here, we will proceed with the UMAP method for visualizing the clusters.
seurat_processed <- RunUMAP(seurat_processed,
reduction = "pca.sketch",
reduction.name = "umap.sketch",
return.model = T, dims = 1:50)
seurat_processedAn object of class Seurat
36170 features across 826613 samples within 2 assays
Active assay: sketch (18085 features, 2000 variable features)
3 layers present: counts, data, scale.data
1 other assay present: Spatial.008um
2 dimensional reductions calculated: pca.sketch, umap.sketch
2 spatial fields of view present: P5CRC.008um P5NAT.008um
DimPlot(seurat_processed, group.by="orig.ident", reduction="umap.sketch")These UMAP coordinates are now stored in the dimensional reductions calculated: pca.sketch, umap.sketch. Now we can visualize with the clusters that we calculated earlier using DimPlot() and adding clearer labels withLabelClusters()`.
# Plot UMAP
p <- DimPlot(seurat_processed,
reduction = "umap.sketch", label = T) +
ggtitle("Sketched clustering")
LabelClusters(p, id = "ident", fontface = "bold", size = 5,
bg.colour = "white", bg.r = .2, force = 0)Project Back to Entire Dataset
Now that we have our clusters and dimensional reductions from our sketched dataset, we need to extend these to the full dataset.
Run ProjectData()
The ProjectData function projects all the bins in the dataset (the Spatial.008um assay) onto the sketch assay.
# Increases the size of the default vector
options(future.globals.maxSize= 2000000000)
seurat_processed <- ProjectData(
object = seurat_processed,
assay = "Spatial.008um",
full.reduction = "full.pca.sketch",
sketched.assay = "sketch",
sketched.reduction = "pca.sketch",
umap.model = "umap.sketch",
dims = 1:50,
refdata = list(seurat_cluster.projected = "seurat_cluster.sketched"))Using the sketch PCA and UMAP, the ProjectData function returns a Seurat object that includes:
| Category | Description |
|---|---|
| Dimensional reduction (PCA) | The full.pca.sketch reduction extends the PCA from sketched cells to all bins in the dataset |
| Dimensional reduction (UMAP) | The full.umap.sketch reduction extends the UMAP from sketched cells to all bins in the dataset |
| Cluster labels | The seurat_cluster.projected metadata column labels all cells with cluster labels from sketched cells |
We can now see the additional full-dataset reductions in the object.
seurat_processedAn object of class Seurat
36170 features across 826613 samples within 2 assays
Active assay: Spatial.008um (18085 features, 2000 variable features)
2 layers present: counts, data
1 other assay present: sketch
4 dimensional reductions calculated: pca.sketch, umap.sketch, full.pca.sketch, full.umap.sketch
2 spatial fields of view present: P5CRC.008um P5NAT.008um
Note that a score for the projection of each bin will be saved as a column in the metadata. Now the seurat_cluster.projected column shows values for every bin.
head(seurat_processed@meta.data) orig.ident nCount_Spatial.008um
P5CRC_s_008um_00602_00290-1 P5CRC 225
P5CRC_s_008um_00789_00234-1 P5CRC 76
P5CRC_s_008um_00728_00006-1 P5CRC 215
P5CRC_s_008um_00681_00396-1 P5CRC 74
P5CRC_s_008um_00078_00444-1 P5CRC 65
P5CRC_s_008um_00128_00278-1 P5CRC 1300
nFeature_Spatial.008um log10GenesPerUMI mitoRatio
P5CRC_s_008um_00602_00290-1 199 0.9773277 0.04888889
P5CRC_s_008um_00789_00234-1 71 0.9842859 0.01315789
P5CRC_s_008um_00728_00006-1 196 0.9827724 0.04651163
P5CRC_s_008um_00681_00396-1 71 0.9903846 0.04054054
P5CRC_s_008um_00078_00444-1 57 0.9685377 0.23076923
P5CRC_s_008um_00128_00278-1 906 0.9496410 0.17307692
leverage.score seurat_cluster.sketched
P5CRC_s_008um_00602_00290-1 0.09142904 <NA>
P5CRC_s_008um_00789_00234-1 0.01895360 <NA>
P5CRC_s_008um_00728_00006-1 0.20933028 <NA>
P5CRC_s_008um_00681_00396-1 0.06665954 <NA>
P5CRC_s_008um_00078_00444-1 0.02559120 <NA>
P5CRC_s_008um_00128_00278-1 0.17665603 <NA>
seurat_clusters seurat_cluster.projected
P5CRC_s_008um_00602_00290-1 <NA> 3
P5CRC_s_008um_00789_00234-1 <NA> 2
P5CRC_s_008um_00728_00006-1 <NA> 2
P5CRC_s_008um_00681_00396-1 <NA> 3
P5CRC_s_008um_00078_00444-1 <NA> 0
P5CRC_s_008um_00128_00278-1 <NA> 1
seurat_cluster.projected.score
P5CRC_s_008um_00602_00290-1 1.0000000
P5CRC_s_008um_00789_00234-1 0.9325007
P5CRC_s_008um_00728_00006-1 0.9965810
P5CRC_s_008um_00681_00396-1 1.0000000
P5CRC_s_008um_00078_00444-1 0.6967176
P5CRC_s_008um_00128_00278-1 1.0000000
Visualize Projected Clusters
We can now visualize our clusters from the projected assignments. The UMAP plot now contains more points, which is expected because we are now visualizing the full dataset rather than our 10,000 bin sketch. Nonetheless, we can see that the full dataset is still well-represented by the projected dimensional reduction and clustering.
# switch to full dataset assay
DefaultAssay(seurat_processed) <- "Spatial.008um"
# Change the idents to the projected cluster assignments
Idents(seurat_processed) <- "seurat_cluster.projected"
# Plot UMAP
p <- DimPlot(seurat_processed,
reduction = "full.umap.sketch", label = T) +
ggtitle("Projected clustering")
LabelClusters(p, id = "ident", fontface = "bold", size = 5,
bg.colour = "white", bg.r = .2, force = 0)To ensure that our cluster labels are listed in proper numeric order, we are going to make our seurat_cluster.projected a factor and set the levels such that the order is correct.
# Arrange so clusters get listed in numerical order
seurat_processed$seurat_cluster.projected <- seurat_processed$seurat_cluster.projected %>%
as.numeric() %>%
as.factor()
seurat_processed$seurat_cluster.projected %>% head()P5CRC_s_008um_00602_00290-1 P5CRC_s_008um_00789_00234-1
12 2
P5CRC_s_008um_00728_00006-1 P5CRC_s_008um_00681_00396-1
2 12
P5CRC_s_008um_00078_00444-1 P5CRC_s_008um_00128_00278-1
14 15
Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Now we should see the levels in correct numeric order.
# Change the idents to the projected cluster assignments
Idents(seurat_processed) <- "seurat_cluster.projected"
# Plot UMAP
p <- DimPlot(seurat_processed,
reduction = "full.umap.sketch", label = T) +
ggtitle("Projected clustering")
LabelClusters(p, id = "ident", fontface = "bold", size = 5,
bg.colour = "white", bg.r = .2, force = 0)In order to see the clusters superimposed on our image we can use the SpatialDimPlot() function. We will also set the color palette and convert the cluster assignments to a factor so they are ordered numerically rather than lexicographically in the figure.
SpatialDimPlot(seurat_processed,
pt.size.factor = 7,
image.alpha = 0)Save!
Now is a great spot to save our seurat_processed object as we have added a lot new information into our Seurat object, such as:
- Sketch HVGs
- Scaled log-normalized counts
- PCA
- k-Nearest neighbors
- Clusters
- UMAP
- Projected clusters
# Save Seurat object
saveRDS(seurat_processed, "data/seurat_processed.RDS")